Text Mining ist ein semi-automatisierter Prozess der Wissensentdeckung in textuellen Daten. Target is to go from unorganized state, to organized state.
Gibt 2 Typen:
Use-cases: Web Mining, Sentiment Analysis, …
| NLP | Text Mining |
|---|---|
| Verstehen von Text und Sprache | Finden von Mustern |
| nutzt häufig Deep-Learning und Tools wie spaCy (open-source library for Natural Language Processing in Python) | nutzt Feature Hashing und Distanz-Maße zusammen mit Machine-Learning-Methoden |
| geht auch auf Audio ein | bezieht sich nur auf Text |
| kann übersetzen | kann nur Beziehungen aufzeigen (leichter zu bewerten) |
| viel breitere Anwendung gegenüber SPAM oder Sentiment analysis | gewisse NLP-Prozesse können Teil vom Text-Mining-Preprocessing sein |
Term Document Matrix (TDM) / Document Term Matrix (DTM) ist quasi Transposition voneinander.
Important are type of words and their order. One word can have three features associated:
Treats every word as a single token in the sentence, no matter the type or order.
(= Worteinbettungen)
Analyze how words are embedded in text. In what context do words appear? So word embedding is a technique for identifying similarities between words by using a model to predict the co-occurence of words. By examining the adjacency of words in a model created space, the models can complete analogies such as “Man is to woman as king is to queen.”
Word embeddings are created by identifying the words that occur within a “context window” (= words before and after a center word). These words will be used to train a word embedding model. Center word + context words = vector of numbers for presence or absence of unique words within a dataset.
| Continuous Bag of Words (CBOW) Model | Skip-Gram Model |
|---|---|
| reads in the context window words and tries to predict the most likely center word | predicts the context words given the center word |
| useful in practical applications such as predictive web search | most useful for identifying patterns within texts to represent them in multimensional space |
Nutzt One-token-per-row-Prinzip: Token ist eine sinnvolle Entität im Text (Paragraphen, N-Gramme, Wörter, Subwörter). Dieses Prinzip unterscheidet tidytext von anderen Ansätzen:
library(janeaustenr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(stringr)
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text_df <- tibble(line = 1:4, text = text)
text_df %>%
unnest_tokens(word, text)
N-gram modeling is alternative to word embedding techinques. It is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. Used to convert text from an unstructured format to a structured format.
For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.
If N=3, the n-grams would be:
Difference from N-gram model to skip gram model | answer from chatGPT:
While Skip-gram and N-gram models are both used in text mining and natural language processing, they are different in their approaches and purposes.
N-gram Model: This model predicts the occurrence of a word based on the probability of it appearing given the previous N words. For instance, in a bigram model (N=2), the probability of a word is predicted based on the previous word. It’s a form of probabilistic language model that calculates the probability of a word given its previous N words.
Skip-gram Model: Skip-gram is a type of word embedding model used to represent words in a continuous vector space. It’s part of the word2vec framework developed by Google. Unlike N-grams, the skip-gram model is not focused on predicting words sequentially. Instead, it tries to learn representations of words that are useful for predicting the context words within a certain window around the target word.
In summary, while both are used in text mining, N-gram models focus on predicting sequential words, whereas skip-gram models focus on learning distributed representations of words in a continuous vector space, often used for tasks like word similarity and language modeling.
“words” (default): Nur Wörter
“characters”: nur Buchstaben
“character_shingles”: wie n-gram, aber mit Buchstaben
“ngrams”: Mehrere Wörter als eine Einheit
text_df %>% unnest_tokens(ngram, text, token = "ngrams", n = 2)
“skip_ngrams”
text_df %>% unnest_tokens(skip_ngrams, text, token = "skip_ngrams", n = 2)
“regex”: regex halt
“tweets”: Wort-Tokenisierung mit Bewahrung von usernames, hashtags und URLS
“ptb” (Penn Treebank): “This tokenizer uses regular expressions to tokenize text similar to the tokenization used in the Penn Treebank. It assumes that text has already been split into sentences. The tokenizer does the following:
text_df %>% unnest_tokens(ngram, text, token = "ngrams", n = 2)
text_df %>% unnest_tokens(skip_ngrams, text, token = "skip_ngrams", n = 2)
Was passiert hier?
library(dplyr)
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
original_books
austen_books() contains 6 novels from Jane Austen as a
tidy data frame: two columns: text, which contains the text
of the novels divided into elements of up to about 70 characters each,
and book, which contains the titles of the novels as a
factor in order of publication.
Then add columns where row number get saved as line number and the recognized reg-ex as chapter number.
Was passiert hier?
library(tidytext)
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books
Every word gets separated on a new line along with the book title, line number and chapter number.
data(stop_words)
tidy_books <- tidy_books %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_books %>%
count(word, sort = TRUE)
library(ggplot2)
tidy_books %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
stop_words is a data frame from tidytext with 1149
english stop words (about, above, after, etc).
anti_join filters rows based on the presence or absence
of matches.
Then the words that are left are counted and plotted, if they are above 600 counts.
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(123)
wordcloud(simplify2array(tidy_books %>%
count(word, sort = TRUE) %>% select(1)),
simplify2array(tidy_books %>%
count(word, sort = TRUE) %>% select(2)),
min.freq=40, max.words=100)
library(pdftools)
## Using poppler version 23.04.0
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
all_pdfs <- list.files(pattern = "06_BusinessIintelligence_Text Mining_HS23.pdf$")
sw <- get_stopwords(language = "de", source = "snowball")
map_df(all_pdfs, ~ tibble(txt = pdf_text(.x), mustWork = F) %>%
mutate(filename = .x) %>%
unnest_tokens(word, txt)) %>%
anti_join(stop_words) %>%
anti_join(sw) %>%
filter(!grepl("[0-9]", word)) -> mypdf
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
library(wordcloud2)
wordcloud2(size=.4,mypdf %>%
count(word,sort = TRUE)
)
Finden Sie heraus, was Word Embeddings sind und beschreiben Sie in einigen wenigen Sätzen, was diese sind und mit PCA-Komponenten gemeinsam haben. https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html
Lösung:
Embedding Models analysieren den Kontext von Worten in einem Satz/Text. Dafür wird ein Zentrum bestimmt und dann wird geschaut, welche Wörter vor und nach diesem Zentrum stehen (Context Window). Die Wörter im Context Window können als Vektoren repräsentiert werden. Dann gibt es zwei Methoden zur von Word Embeddings: Continuous Bag of Words (CBOW) Model und Skip-Gram Model.
Bei CBOW wird aus Wörtern des Kontextwindows versucht das Zentralwort zu bestimmen.
Bei Skip-Gram werden genau umgekehrt aus dem Zentralwort die Kontextwörter bestimmt.
Das CBOW-Modell kann vermutlich wie eine PCA verstanden werden. Die Dimensionen werden versucht zu reduzieren auf ein Wort, welches am stärksten mit den eingegebenen Wörtern zusammenhängt, sich am besten darin einbettet.
Beschreiben Sie, was skip_ngrams in der Funktion unnest_tokens macht und wozu man es etwas brauchen könnte.
Lösung: Es geht alle Wörter der Reihe nach durch und stellt sie zuerst alleine dar, dann mit dem darauffolgenden und dann mit dem übernächsten, lässt aber das darauffolgende weg (skip). Es nimmt also immer wieder ein Wort dazu, lässt dieses aber danach wieder weg. Bei n=2 sind es 3 Erzeugnisse pro Wort, bei n=3 schon 7 Erzeugnisse, bei n=4 sind es 15.
Der Nutzen könnte sein, dass das Erkennen von Zusammenhängen nicht durch Füllworte (und, aber etc) behindert wird.
Machen Sie eine WordCloud für mindestens 10 PDF Dokumente in einem Ihrer Ordner mit Hilfe des R-Skriptes tm-sent.R in den Materialien.
all_pdfs <- list.files(pattern = "*.pdf$", recursive = TRUE) # with recursive, all subfolders get searched (in this case a folder with bwl course slides [16 pdf files])
sw <- get_stopwords(language = "de", source = "snowball")
weglassen <- c("école", "haute", "applied", "bern", "berner","bernoise","sciences", "university","spécialisée", "fachhochschule")
map_df(all_pdfs, ~ tibble(txt = pdf_text(.x), mustWork = F) %>%
mutate(filename = .x) %>%
unnest_tokens(word, txt)) %>%
anti_join(stop_words) %>%
anti_join(sw) %>%
filter(!word %in% weglassen) %>%
filter(!grepl("[0-9]", word)) -> mypdf
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
library(wordcloud2)
# wordcloud2(size=.4, mypdf %>% count(word, sort = TRUE)) # uncomment to generate the interactive wordcloud
# Below is png of generated wordcloud, since only one wordcloud per rmarkdown.html can be present.
library(textdata)
get_sentiments("afinn")
get_sentiments("bing")
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 131015 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
jane_austen_sentiment
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
From https://www.sciencedirect.com/science/article/pii/S0933365715000299?via%3Dihub (Sentiment analysis in medical settings: New opportunities and challenges, Artificial Intelligence in Medicine, Volume 64, Issue 1, 2015, Kerstin Denecke, Yihan Deng):
| Entity | Possible sentiment values |
|---|---|
| Health status | Improve, worsen |
| Medical condition | Present, improve, worsen |
| Diagnosis | Certain, uncertain, preliminary |
| Effect of a medical event | Critical, non-critical |
| Medical procedure | Positive or negative outcome, successful or unsuccessful |
| Medication | Helpful, useless, serious adverse events |
Tokenisieren Sie reviews.csv in den Materialien analog wie original_books <- austen_books() zu tidybooks im Abschnitt “Text Mining mit TidyText” vom R-Skript tm-sent.R tokenisiert wurde. Wenden Sie anschließend die Sentiment-Analyse aus tm-sent.R auf reviews.csv an. Wie interpretieren Sie das Ergebnis?
library(tidyr)
library(dplyr)
library(tidytext)
library(stringr)
library(textdata)
original_reviews <- as_tibble(read.csv("./Materialien-20231126/reviews.csv"))
tidy_reviews <- original_reviews %>%
unnest_tokens(word, sentence)
tidy_reviews
reviews_sentiment <- tidy_reviews %>%
inner_join(get_sentiments("bing")) %>%
count(sentence_index, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
reviews_sentiment
library(ggplot2)
ggplot(reviews_sentiment, aes(sentence_index, sentiment, fill = sentence_index)) +
geom_col(show.legend = FALSE)
From Chris Bail, PhD, Duke University (https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html):
“So what are Word Embeddings Useful for? Unfortunately, many of us are still trying to work this out. They have obvious applications for industry (especially in the area of search and information retrieval), but it is less clear how they can be used for empirical studies.”